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Abstract 

We derive a new representation for a function as a linear combination of local correlation 
kernels at optimal sparse locations and discuss its relation to PCA, regularization, sparsity 
principles and Support Vector Machines. We also discuss its Bayesian interpretation and 
justification. 

We first review previous results for the approximation of a function from discrete data 
(Girosi, 1998) in the context of Vapnik's feature space and dual representation (Vapnik, 
1995). We apply them to show 1) that a standard regularization functional with a stabilizer 
defined in terms of the correlation function induces a regression function in the span of the 
feature space of classical Principal Components and 2) that there exist a dual representations 
of the regression function in terms of a regularization network with a kernel equal to a 
generalized correlation function. We then describe the main observation of the paper: the 
dual representation in terms of the correlation function can be sparsified using the Support 
Vector Machines (Vapnik, 1982) technique and this operation is equivalent to sparsify a 
large dictionary of basis functions adapted to the task, using a variation of Basis Pursuit 
De-Noising (Chen, Donoho and Saunders, 1995; see also related work by Donahue and 
Geiger, 1994; Olshausen and Field, 1995; Lewicki and Sejnowski, 1998). 

In all cases - regularization, SVM and BPD - we show that a bayesian approach jus- 
tifies the choice of the the correlation function as kernel. In addition to extending the 
close relations between regularization, Support Vector Machines and sparsity, our work 
also illuminates and formalizes the LFA concept of Penev and Atick (1996). We discuss 
the relation between our results, which are about regression, and the different problem of 
pattern classification. 
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1 Introduction 

In supervised learning problems we are given a discrete data set D\ = {(x 8 ,z 8 ) i— > X X Z} 
obtained by sampling N times the set X X Z according to P(x.,z). The goal of learning is 
to provide a deterministic function /(x) which models the relationship between X and Z and 
thereby solves the associated regression problem. 

A specific example that we will use throughout this paper is the regression problem of recon- 
structing a specific image / given its pixel values at discrete locations in the image plane. This 
paper focuses on a special version of this problem, in which prior information is available in terms 
of the correlation function of images of the same type as /. 

We first reformulate known results to show that the classical Principal Component representation 
is associated with a regularization formulation of the problem, in which the stabilizer is defined 
in terms of the correlation function of an ensemble of functions f a of the same type as the / of the 
regression problem. Principal Components thus correspond to a special case of the feature space 
of Vapnik (1995). Regularization provides another dual representation - in Vapnik's language - 
for the regression function in terms of a weighted sum of correlation kernels each centered at a 
data point x 8 -. This dual representation contains a large number of terms if the number of data 
points is large (for instance all pixels in an image). Girosi's results show that it can be sparsihed 
using the SVM formulation (Vapnik, 1995) and that this is equivalent to enforcing a sparsity 
constraint like in Chen, Donoho and Saunders (1995). Regularization, SVM and a special form 
of BPD have a Bayesian interpretation: we show that this equivalence (see Wahba, 1990) can be 
used to justify the use of the correlation function as the kernel. 

We will also discuss how our regression results are related to corresponding classification tasks 
and how the kernels obtained for regression may be used for a pattern recognition problem in a 
SVM classifier, thus providing sparse features for a classification task. 

We first give our reformulation of existing results and then describe our main observations. 
We assume that the reader is familiar with regularization, SVM techniques and sparsihcation 
algorithms (see Girosi, 1998). 

2 Background 

2.1 Reproducing Kernels and Regularization 

Let us hrst summarize the basic results we will need from the theory of regularization. They 
are a special case of the technique discussed by Girosi (1998) and can also be found in Wahba 
(1990). Regularization techniques as developed by us to solve supervised learning problems 
(Poggio and Girosi, 1989; Girosi, Jones, Poggio, 1995) were limited to shift invariant stabilizers 
(tensor product and additive stabilizers are special exceptions, see Girosi et al. 1995): the 
underlying kernel G(x,y) was constrained to be G(x,y) = G(x. — y), strongly limiting - in the 
language of Vapnik (1995)- the type of associated feature representations (the eigenfunctions of 
the associated integral operator are always Fourier basis functions). It is however possible to 
construct kernels of the general form G(x,y) (see Wahba, 1990; Girosi, 1998). 
Consider a positive definite function K(x., y). It is well known that K defines an integral operator 
with a complete system of orthogonal eigenfunctions that can be made orthonormal and ordered 



with decreasing eigenvalue 1 with positive X n 

dy K(x.,y)<f> n (y) = \ n <f> n (x.) (V 



IR d 

and the following series representation that converges absolutely and uniformly: 

oo 

/i(x,y) = ;CA n <Mx)<My) (2) 

71 = 1 

We now define a scalar product in the function space spanned by the system of <f> n and thus 
induced by K , as follows: 

< f,g >K= J2t~ < f'^n >< 9 An > (3) 

With this dehnition K dehnes a Reproducing Kernel Hilbert Space (RKHS) with a corresponding 
regularization functional: 

H[f] = ^h^-f(xi)) 2 + ^\\f\\K (4) 

8 = 1 

where \\f\W = J2 n < ^ n> is the norm in K induced by the scalar product defined earlier. 
Minimization of 4 yields the usual solution in terms of regularization networks 

N 

/( x ) «X>,-#(x,x,-), (5) 

8 = 1 

solving the regression problem of estimating / from the discrete data (x 8 ,z 8 ). As we mentioned 
earlier, the specific example we have in mind is / an image and x a vector in the plane. 

2.2 Vapnik's Feature Space and Regularization 

The previous section implies that any positive definite kernel K induces a RKHS defined by the 
feature vector </>(x) = (\/\~[<f>i(x.), V / A^^ 2 (x), . . . , \/\^<f> n (x.), . . .), with 

</>(*)• </>(y) = ^(x,y) 

As a consequence, a function in the RKHS space spanned by the orthonormal features can be 
represented 2 as 

oo 
/(x) = ^Wn(x) (6) 

71 = 1 

and also approximated in terms of the dual representation (because of the underlying regular- 
ization principle of the previous section) 

N 

/(x) = $>.-#(x,x,-). (7) 

8 = 1 



: We use A n instead of the usual (for integral operators) — 
2 In the discrete case f = $b 



Instead of starting from a given K and derive the feature space we could start from any set of 
orthonormal functions <f> n - our features - with appropriate X n and construct a regularization 
kernel K(x.,y) as 

A(x,y) = </>(x)-</>(y) (8) 



Remarks: 

1. When the <f> n are a finite set, the X n can be arbitrary (finite) numbers, since there are no 
convergence problems. In particular they can all be equal to one. Of course, the choice 
of the <f> n defines the space of functions that can be represented accurately in terms of the 
features. 

2. All translation-invariant stabilizers (A^(x,x 8 ) = K(x. — x 8 )) correspond to Fourier eigen- 
functions and only differ in the spectrum of the eigenvalues (for a Gaussian stabilizer the 
spectrum is Gaussian \ n = Ae^~ n ' 2 > (for a = I)). 

3. In standard regularization with translation invariant stabilizers and associated kernels, the 
common experience, often reported in the literature, is that the form of the kernel does not 
matter much. We conjecture that this may be because all translation invariant K induce 
the same type of <f> n features - the Fourier basis functions. Correlation functions which are 
not translation invariant can dehne instead quite different sets of features which are likely 
to have quite different effects. 

2.3 PCA and Regularization 

Until now we have considered the regression problem of estimating / from discrete data. In our 
example of image reconstruction / would map location x on the image plane to a real value - the 
image value at that location. A limit case of the regression problem is classification in which the 
range of / is {0, 1}. In our image example, classification corresponds to estimating the binary 
value of a pixel at a desired location from (binary) values at sparse locations in the (binary) 
image. 

From now on, we will consider a special case of the regression-classification problem: we will 
assume that in addition to the training data - which are values of the underlying function / at 
discrete locations x 8 - - we also have information about the class of function to which / belongs. 
In particular, we will assume that the underlying correlation function is known. More formally, 
the given / is taken to belong to a set of functions f a over which a probability distribution P(a) 
is defined. In our standard example of / being a specific image, the f a are images of the same 
type, all aligned and registered, for instance images of faces. Then the correlation function of 
the random signal / - of which the f a are realizations - is 

i*(x,y) = £[(/ a (x)/ a (y)] (9) 

where E[-] denotes expectation with respect to P(a). In the following we will always assume 
that the average function is the null function: E[f a (x.)] = 0. 

The correlation function R is positive definite and thus induces a RKHS with the X n defined by the 
eigenvalue problem satisfied by R (Hilbert-Schmidt-Mercer theorems). It follows that R provides 



a "natural" kernel - among the many possible - for solving the regression-classification problem 
from discrete data for / (see section 4). It also provides the standard Principal Components 
representation for / in terms of a (in practice finite) set of M <f) n , n = I,-,Af. The following 
points hold true: 

I. There exists a regularization formulation corresponding to the PC A choice 

N 

H[f] = Y,^-f(xi)) 2 + ^\\f\\R (io) 



8 = 1 



where \\f\\* R = Ef=i ^^ 

2. The regression solution / is in the span of the <f> n and can be represented in terms of M 
Principal Components (with M finite or infinite) as 

M 

/(x) = EWn(x) (11) 

71 = 1 

3. / can be represented in terms of a regularization network as 

N 

/(x) = $>,-i2(x,x,-) (12) 

8 = 1 

Notice that often only an estimate of R is available and that usually this estimate may 
be highly rank deficient (see appendix C). In these cases instead of R, one can use a 
regularization kernel i? M (x,y) defined as the natural approximation of R in the space of 
the available M Principal Components: 

M 

^(x,y) = EUWWy) (13) 

71 = 1 

In the following we will drop the superscript M in our notation. 

The two representations are equivalent (under the same error criteria) when the number of 

principal components is chosen equal to be M. Notice that, unlike the global <f) n , the basis 

functions i?(x,x 8 ) are usually quite local: consider for instance the translation invariant case of 

natural images, where the <f> n are Fourier components, while the correlation is relatively short 

range. 

Notice that in equation 13 one can assume that the only available prior knowledge is which 

M eigenfunctions are relevant. In the case of finite M we can then define several different 

regularization kernels all corresponding to the same PCA decomposition. The most natural 

kernel is simply the projection operator 

M 

P(x,y) = $>*(x).^(y) (14) 

71 = 1 

P plays the role of the 8 function in the space of the <f> n . It has an associated regularization 
formulation (with a stabilizer ||/||p = J2 n =i < fi 't'n > 2 )- Thus / can be also represented as 



N 



/(x) = $>P(x,x,-) (15) 

8 = 1 

Following the spirit of a suggestion by Penev and Atick (1996), we can define generalized corre- 
lation kernels parametrized by d (for M finite) as 



M 

71 = 1 



i? d (x,y) = V(A n ) d ^(x)^(y) (16) 



of which P = R and R = Ri are special cases. It is not completely trivial to notice (following 
Penev and Atick, 1996) that d controls the locality of the kernel. In the shift-invariant case, for 
which the <f> n are Fourier basis functions, d acts as a filter: low-pass for increasing d and high- 
pass for decreasing d. Thus locality increases for decreasing d: for instance when R^ is a Radial 
Basis Gaussian function, d controls directly the effective a of the Gaussian. The most interesting 
values of d range between and 1: i? , which is less smooth than i?i, plays the role of the 8 
function in the space spanned by the <f> n while R^ with negative d are similar to "derivatives" 
of the delta function (consider the example of band-limited functions for which the analog of 
the delta is the sine function). For positive integer d > 1, R^ is a so called iterated kernel: for 
instance i? 2 (x,y) = / o?zi?(x, z)i?(z, y), which indicates that positive d corresponds to integral 
operators (while negative d correspond to differential operators). 
Remarks: 

1. Given a set of <f> n the spectrum X n of the correlation function R depends on the specific 
P(ct) since 

An = E[bl] (17) 

where the b n are the coefficients of the expansion of the function / in the set of eigenfunc- 
tions <f> n . 

2. The stabilizer in the form ||/||fj = J2 n =i Id* has obvious smoothness properties (smooth- 
ness increases with d), since the eigenfunctions (ordered as usual) typically have increasing 
high-frequency content as n increases (theorems in the theory of integral equations, like in 
Courant and Hilbert, 1953, relate the number of nodes or zeros of the eigenfunctions to 
their index n). 

3. Since regularization has a Bayesian interpretation (Kimeldorf and Wahba, 1971; Girosi, 
Jones and Poggio, 1995) we have now a probabilistic interpretation of PCA in terms of a 
prior probability on the space of functions f a given by P(f) = e^~"^'R> and a Gaussian 
model of the noise (see section ?? and Wahba, 1990). 

4. The kernels P and R - and in general Rj - correspond to different prior probabilities (they 
are multivariate Gaussian priors with different covariances). They define, however, the 
same set of basis functions <f> n - Vapnik's features - and are therefore expected to behave 
in a similar way. 

5. There is a relation between equation 12 and kriging (see Wahba, 1990). 



6. The sine function is the translation invariant correlation function of the set of one-dimensional 
band-limited functions with a flat Fourier spectrum up to f c (and zero for higher frequen- 
cies). Perhaps more interestingly it also corresponds to the operator P of the band-limited 
functions with a given cut-off (and with any correlation function). The sine function is a 
positive definite reproducing kernel with negative lobes. 

7. The PC A representation equation f f can be used to solve the problem of regression from 
N discrete data at locations X; by computing the h{ from 

M 

/(x,) = ^M„(x,) /=l,-,iV (18) 

71 = 1 

In general, the equations can be solved if N is at least equal to M. Note that equation 12 
can be used even when N « M and equation 18 cannot be solved. The regularization 
representation equation 12 can be obtained solving for the a n from the data (using a = 
(R + 7/) _1 y with positive or zero 7). 

8. The connection between the a n of equation 12 and b n of equation 11 is given by 

N 

h = XkJ2 a J^k{^j) (19) 



9. As we will see later, estimates of the correlation function may often be possible from a 
sufficient set of examples, even in cases in which the dimensionality of the discretized space 
is very high. 

10. Penev and Atick (1996) remarks that the object P corresponds to local features, similar to 
local receptive fields with compact support. 

11. Wahba (1990) discusses the relation between regularization, RKHS and correlation func- 
tions of Gaussian processes. In particular, / in the RKHS defined by R and / a sample 
function from a zero-mean Gaussian stochastic process are not the same (when R has more 
than a finite number of non-zero eigenvalues). 

3 Sparsification of the regularization representation and 
Support Vector Machines 

Let us consider again the main scenario we sketched above: a space of functions is characterized 
probabilistically through its correlation function i?(x,y). Any function / in the space can be 
represented in terms of the eigenvectors associated with R. An (approximate) representation of 
/ in terms of a finite numbers of Principal Components is a natural compact approximation of 
/. It is natural to ask whether we could sparsify the N-terms dual representation of / in terms 
of a regularization network, that is the weighted sum of the kernels R centered at N data points. 
A natural way to sparsify 

N 

/(x) = $>,-i2(x,x,-) (20) 

8 = 1 




is to use SVM regression (Vapnik, Golowich and Smola, 1997; and Vapnik, 1995 ) with the kernel 
R (or i? M , see later). As shown by Girosi (1998), this corresponds to minimizing -instead of the 
regularization functional 4 - the following functional 

1 N 
H[f] = T7Y,\^-f(Xi)\c+^\\f\\R (21) 

8 = 1 

where the following robust error function has been defined instead of the usual L 2 norm on the 
data term: 

if |x|<e 
-e otherwise. 

The function that minimizes the functional in eq. (21) depends on a finite number of parameters, 
and has in our case the following form: 

N' 

/(x) = $>,-i2(x,x,-), (23) 

8 = 1 

where the coefficients a 8 - are now found by solving a quadratic programming problem. Notice 
that the sum in equation (23) runs only up to N' , where N 1 < N. The reason is that, due to the 
nature of this QP problem, only a "small" number of coefficients will be different from zero, and 
the data points associated to them are called support vectors (in many cases N 1 << A). Thus 
we can sparsify the regularization representation of a function by using the correlation function 
as the regularization kernel in equation 21. 

We now invoke a result in Girosi (1998, section 5) to claim that the result of minimizing equation 
21 is the same as of sparsifying the overcomplete dictionary o/i?(x,x 8 ) using Basis Pursuit De- 
noising (Chen, Donoho and Saunders, 1995) (see also the sparsihcation approaches of Olshausen 
and Field, 1996, and Lewicki and Sejnowski, 1998). The proof consists of applying the Girosi 
version of the Chen-Donoho cost functional to sparsify equation 20, leading to the minimization 
of the following functional with respect to the coefficients a 8 - 

1 N 

E[*] = dl/(x) - l>^(x;x.-)l& + £ ||a|| Ll (24) 

8 = 1 

The solution of equation 24 is the same as the solution of minimizing 21, which is given by 
equation 23. Thus a solution equivalent to the SVM solution - in which only a subset of the 
data points has non-zero coefficients, the so-called support vectors - can be obtained simply by 
enforcing a sparsity constraint in an approximation scheme of the standard regularization form 
with R being the correlation matrix 

N 

/( x ) = J2 a * R (^ x 

8 = 1 

a sparse representation is sought among a "large" number of possibly local and task-dependent 
features i?(x,x 8 ). 

Notice that the framework of sparsihcation (and the equivalent SVM) allows us to consider a 
dictionary of overcomplete basis functions and in particular of R not only at multiple locations 
but also at multiple scales. A natural way to dehne such a dictionary is to consider, instead of 
i?, Rd for several different values of d and, of course, at many locations (for instance at each 

7 



pixel in an image). In this case (see appendix) we minimize the sparsihcation functional to select 
appropriate sparse scales and locations, yielding a sparse, multi-scale representation 

N',D' 

/(x) = ^2 a itd R d (-x,Xi), (25) 



Remarks: 

1. Basis Pursuit Denoising provides only a suboptimally sparse representation from a dic- 
tionary (because it uses HaH^ instead of ||a||i, in equation 24) but it probably has good 
generalization (because in the form of equation 24 it is equivalent to SVM). 

2. The form of the solution - a superposition of kernels - does not depend on the form of the 
norm involved in the data term, as observed earlier by Girosi, Caprile and Poggio (1990). 
In particular, it is the same for the standard L 2 norm and for the robust norm defined by 
Vapnik. 

3. Our approach of sparsifying the representations of / in terms of the generalized correlation 
kernel R d is a principled way to achieve the sparsihcation proposed by Penev and Atick 
(1996). 

4. Though the representations of a function / in terms of R d are all equivalent, independent 
of d, in the standard regularization case, we expect that they will have in general different 
properties after sparsihcation. 

4 Bayesian interpretation and why R is the kernel of 
choice 

Consider 

N 

™%m = Y l (yi-f(xi)) 2 + 'r\\f\\ 2 K 

In the standard bayesian interpretation of RN (see for instance (see Girosi et al., 1995) the data 
term is a model of the noise and the stabilizer is a prior on the regression function /. Informally 
the equation follows from a MAP estimate of 

P(f/y) oc P(y/f)P(f) 

To see the argument in more detail, let us assume that the data j/ 8 - are affected by additive 
independent gaussian noise processes, i.e. j/ 8 - = f(x{) + e 8 - with £^[eiej] = 2<5 8 - J 



P(y//)ocexp(-^(j/ 8 -/(x 8 )) 5 



anc 




P(/)ocexp(- \\f\\ 2 R ) =exp 
where M < oo 

M 

/( x ) = ^c n <^(x). 

n=l 

Thus the stabilizer measures the Malahanobis distance of / from the mean of f a . To see this, 
let us represent / in any complete orthonormal basis ipi as the vector f 8 =< f,ipi >. We assume 
that the data are zero-mean in the sense that E[f a (x.)] = (obviously the data can always be 
processed to satisfy this condition). Then we know that if P(f) is Gaussian then 

P(f) oc expC-f^E)"^ ) 

and f T (E) _1 f is the Malahanobis distance of f from its mean (the origin). P(f) is therefore a 
multivariate Gaussian with zero mean in the Hilbert space of functions defined by R and spanned 
by the <f) n , that is the space related to Principal Components. 
Remarks: 

1. Notice that for SVM the prior is the same Gaussian prior but the model of the noise is 
different and is NOT gaussian additive as in RN (see Pontil et al., 1998 ). The same is true 
for BPD, given the equivalence between SVM and BPD. 

2. Thus also for SVM (regression) and BPD the prior P(f) gives a probability measure to / 
in terms of the Malahanobis distance in the Hilbert space defined by R and identical to 
the space of the Principal Components. 

3. There is a natural probabilistic interpretation of the data term (see Girosi et al., 1995). 
As we have mentioned, in the case of standard regularization, the data term norm (a L 2 
norm) corresponds to a Gaussian model of the noise, that is the conditional probability of 
the data Z{ given the function is a Gaussian. Other norms can be interpreted as shown by 
Girosi, Caprile and Poggio (1990) in probabilistic terms as different models of the noise. 
Pontil et al. (1998) have derived the noise model corresponding to Vapnik's e insensitive 
norm. 

4.1 Why R is the kernel of choice. 

Assume that the problem is to estimate / from sparse data j/ 8 - at location x 8 -. From the previous 
description it is clear that choosing a kernel K is equivalent to assuming a Gaussian prior on / 
with covariance equal to K. Thus an empirical estimate of the correlation function associated 
with a function / should be used, whenever available. Notice that in the Bayesian interpretation 
a Gaussian prior is assumed in regularization as well as in SVM (and in the equivalent BPD 
formulation). Thus when empirical data are available on the statistics of the family of functions 
f a one should check that P(f) is Gaussian and make it zero-mean. Then an empirical estimate 
of the correlation function E[(f a (x.)f a (y)] can be used as the kernel. 

The relation between positive definite kernels and correlation functions R of Gaussian random 
processes is characterized in details in Wahba (1990), Theorem 5.2. 



5 Conclusions 

We know from Wahba (1990) and Girosi (1998) that, given a positive definite K , 

1. a regularization functional can be defined 

2. a function in the RKHS can be represented in terms of the non-linear features provided 
by the orthonormal eigenfunctions of K and also in terms of a linear combination of the 
kernels K evaluated at sparse points 

3. the data term in the regularization functional can be modified to yield a SVM formulation 

4. minimizing the SVM functional is the same as sparsifying the regularization representation, 
that is the dictionary of A^(x,x 8 ). 

Here we consider the case in which the kernel K is a very special "object" - the correlation 
function i?(x, y) and justify this choice in terms of the Bayesian interpretation of regularization, 
SVM and BPD . We focus on its role in regression (function approximation from sparse data) 3 . 
We show that 

1. a function can be represented either by the Principal Components induced by the associated 
correlation function or in a dual way by the regularization solution - a weighted sum of 
correlation kernels evaluated at N data points. 

2. the representation in terms of the correlation kernel can be sparsihed using the SVM 
technique or, in a completely equivalent way, by using the basis pursuit denoising technique 
on the dictionary of i?(x, x 8 ). Notice that this representation is not only compact (see Chen, 
Donoho and Saunders, 1995) but it is also likely to achieve good generalization, (since the 
SVM cost functional implements Vapnik's theory of risk minimization). 

In our case SVM can be therefore regarded as a "sparse" version of a regularization network with 
a kernel derived directly from the correlation function. The regression problem we consider is a 
problem of signal reconstruction; is very different from the problem of pattern classification (see 
Appendix). Following the spirit of Penev and Atick, the same sparsihed kernels computed for 
regression may be used with SVM classifiers - in the same way in which principal components 
are often used - effectively representing a choice of sparse feature from an appropriate large 
dictionary of basis functions (provided by the i?(x,x 8 )). 

Correlation functions that are shift invariant are not very interesting from the point of view 
of the representations discussed here: they all correspond to the same set of Fourier features. 
Of course, the correlation function corresponding to a large set of images of different scenes and 
objects will be translation and scale invariant (see Penev and Atick, 1996 and references therein). 
Properly aligned images of objects of the same type (such as for instance faces or people, see 
Papageorgiou et al., 1998) instead yield correlation functions which are not shift invariant (see 
Sirovich and Kirby, 1988; see also Turk and Pentland, 1990). The associated <f> n features capture 
information about the category of objects. They are however global. The correlation kernels 
i?(x,x 8 ) (or the corresponding P(x,x 8 )), instead yield local "features", which can be sparsihed 



3 Atick and Penev were probably the first to study the correlation function R in the context or regression. 
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and thereby simultaneously optimized for generalization. Results from experiments in progress 
are promising. 

It is suggestive to speculate that cortex may use machinery to align and normalize visual inputs 
so that dictionaries of object specific features can be learned without being affected by arbitrary 
translations and scalings. At earlier stages of the visual system, however, one may expect from 
our results that translation invariant correlation functions associated with non-aligned images of 
different types will determine basis functions similar to local Fourier components. It is interesting 
to speculate that the correlation functions associated with images at different scales may be 
learned separately, providing receptive fields at multiple resolutions. 

Acknowledgments We would like to thank Mike Oren, Amnon Shashua, Alessandro Verri. 
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A Multiple scales correlation kernels for regularization, 
sparsification and SVM 

A.l Multiple scales and classical regularization 

Let us consider here the multiple scale generalized correlation Rj of equation 16. Let us assume 
that /(x) = /i(x) + i*2(x), where fi and f 2 represent the components of / at two different 
scales (the generalization to the case of more than two scales is cumbersome but possible). The 
functional to be minimized is 

N N 

H[f] = £(* - /i(x,)) 2 + r, J2(z t ~ /i(xO - / 2 (x 8 )) 2 + 71H/1I& + 7211/211^ 

8 = 1 8 = 1 

where rj is a positive, small number. The underlying idea is that fi is a coarse approximation to 
the data at one scale, while f 2 is a refinement at a finer scale (/ 2 approximates the residuals of 

/i)- 

A. 2 Multiple scales and sparsification and SVM 

The sparsity functional of Chen et al. can be used to choose a sparse subset from the dictionary 
of basis functions i?d(x,x 8 ), with i and d ranging over a "large" set of locations and scales. 
One possible way of obtaining equation 25 from the SVM technique is the following. We assume 
that /(x) = /i(x) + i*2(x), where fi and f 2 represent the components of / at two different 
scales (the generalization to the case of more than two scale is immediate). The functional to be 
minimized is 

N N 

H[f] = E I *i - /i( x k +vJ2 I * - /i( x - M*i) U +7i||/i|lk + 72||/ 2 ||^ 

8=1 8=1 

where t\ > e 2 , and a rj is a positive, small number. The final result is 

N',D' 

/(x) = Yl Oi,d-Rd(x,Xi), (26) 

i,d 

B Principal Components under Regularization and Spar- 
sification 

As we discussed, minimization of the regularization functional 10 defines a regression function / 
that is in the span of the features space of the Principal Components (that is the eigenfunctions 
of R). As we mentioned the solution / of the regression problem can be represented in terms 
of <f> n and equivalently in terms of i?(x,x 8 ). It is interesting to look at the solution when it is 
expressed in terms of the principal components. We do this in the case in which we have an 
infinite number of data points, which corresponds to the case in which we actually know the 
function we want to approximate. Therefore, we plug the PC representation 
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M 



/(x) = £Mn(x) (27) 

in the regularization functional: 



n=l 



^[/] = ll^(x)-/(x||i 2 + 711/11^ (28) 

where we denoted by g the function we want to approximate. The solution of the minimization 
problem is 

1 + y) 1 <^,flf(x)>. (29) 

Notice that for 7 = we have the usual solution: b n is simply the result of projecting the target 
function <?(x) on the principal component <f> n . For 7 > 0, the effect of regularization is to decrease 
all b n by a factor which depends on the corresponding eigenvalue of the correlation matrix. 
It is interesting to compare these regularization solutions to the sparsihcation and SVM solution 
(which are the same). We consider the minimization with respect to b n of 

£[b] = i||s(x)-/(x|& + e||b|| Ll (30) 

In this case the solution is 

b n = I < g, <p n > -7„e| + + | < g, <p n > +7„e|_ (31) 

where \x\ + (|x|_) is equal to x when x is positive (negative) and equal to otherwise. The 
dependency of b n on < g, cf> n > is plotted in figure (I). In this case, if the principal component n 
has a projection which is too small it is simply not used. The non-zero coefficients are shrunk by 
a factor that depends on the sparsihcation parameter e (and correspondingly on the e insensitive 
norm of SVM) and on the eigenvalue of the correlation matrix. 

Finally, we consider a very different problem: we perform exact sparsification with respect to the 
average reconstruction error over the space of "images" f a rather than with respect to a single 
image. We follow Girosi (1998) and minimize an appropriate functional H, that is 

min#[£] = l -E[\\f a -J2<faAn> M^)Ca\\ 2 ] + e £ &* (32) 

S n n 

where £ n are binary random variables with values in {0,1}, E[-] denotes the expectation with 
respect to P(ct) and the <f> n are the eigenfunctions of R. In this simple case of orthonormal <f> n 
we find that (0(x) is I if x > and otherwise) 

C n = 6(\ n - e) 

Thus only those Principal Components are chosen that correspond to eigenvalues larger that the 
sparsity and SVM parameter e. 
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Figure 1: The value of the coefficient b n as a function of the projection < g, cf> n > of the data g 
on the principal component cf> n . 
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C The discrete case 

It is often believed that the estimation from data of the correlation function R is impossible 
or very difficult, because of the 'difficulty of sampling the space f a . In practice, however, the 
functions f a (as well as cf> n ) must be represented as vectors in a finite dimensional space, albeit 
of possibly very high dimensionality V. Thus, though the correlation matrix R = FF with 
F a VxQ matrix with columns f„ may be highly rank deficient, it is possible to obtain useful 
estimates of it in terms of its M Principal Components, where M < T with T being the number 
of the observations f„, even when T «V . The best technique is to compute the Singular Value 
Decomposition of F, that is F = UDV T where the columns of U are the eigenvectors of FF T , 
the columns of V are the eigenvectors of F T F and the diagonal matrix D contains the singular 
values. Thus an estimate of R can be obtained as the R M of equation 13. 

Assume that values of an unknown vector (say an image) f are given at a discrete set of points 
and that an estimate of the underlying correlation matrix is available as R M . Then f can be 
reconstructed either as 

M 
f x = Yl bi<t>n,x (33) 

71 = 1 

or as 

N' 

f* = £*.<*, (34) 

8 = 1 

D Pattern Classification 

Our standard example in the paper is the problem of image reconstruction from sparse pixel 
values. This is very different from the problem of classifying images, for instance classifying 
whether an image is an image of a face or not. To see this consider the spaces involved: 

1. Image reconstruction. In the case of image reconstruction we would like to approximate 
the map 

/ : K 2 .-> K 

from its values at sparse points in 1Z 2 . The equivalent problem of binary pixel classification 
synthesizes a map 

For solving this problem we could use any positive definite function A^(x, y), such as the 
Gaussian. 

2. Pattern classification. For pattern classification the problem is quite different: we have 
several images, which are vectors of N components (pixels) and each image is associated 
with a binary label. The goal is to learn the map 
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g:K N » {0,1} 

If I replace {0, 1} with 1Z we have the corresponding (difficult !) regression problem. 

The two problems are quite different. They are somewhat related however in the special case 
of problem (I) that we consider in this paper. In this case we have to solve the regression (or 
possibly binary regression) problem for pixels of an image /(x) but we also know the generalized 
correlation function i?(x,y) of the set of similar images f a {x). As we discussed, R provides a 
"natural" choice for the regression kernel K. An estimate of R is given by FF T . In problem 
(2) the input space consist of vectors f (a) that may be related to the functions f a of problem 
(I) by defining each component indexed by x of f(a) as / a (x). One way to solve problem (2) 
(classification or even regression) is to use regularization or SVM with a kernel K(a, f3) equal 
to the dot product, that is K(a,/3) =< f a ,f(s >. The corresponding matrix needed from the 
data is then F F. Obviously, the Q X Q matrix F F and the N X N matrix FF are closely 
related 4 . Notice that in practice it is very difficult if not impossible to estimate empirically 
the correlation function in a classification problem: that is equivalent to estimate the sufficient 
(Gaussian) statistics characterizing the classification functions (in our example on the images 
and not of the images as in the regression case). 
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